RStudio Exercise 1: Tools and methods

The name of the course is Introduction to Open Data Science and we are focusing to language R, RStudio, GitHub and Markdown. You can find my Github repository here.


RStudio Exercise 2: Analysis

Introduction to the data

After the data wrangling exercise the new data set is found from the data folder. The set is based on data that was collected from course Introduction to Social Statistics, fall 2014 - in Finnish. The survey was conducted 3.12.2014 - 10.1.2015 by Kimmo Vehkalahti.

student2014 <- read.table("data/learning2014.txt", header = TRUE)
dim(student2014)
## [1] 166   7

The student data includes 7 variables and 166 rows.

str(student2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Variables deep, strat and surf are combination variables from the original survey data.

summary(student2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

From 166 survey respondents 56 was men and 110 was females. The mean of age was 25,5 years. Oldes respondet was 55 and youngest 17 years old.

Graphical output

Variables differ between genders. Distributions are different in age, attitude and surf (surfface). Three highest correlation between variables are:

  • points-attitude
  • surf-deep
  • surf-attitude

Explanatory variables

Three variables

## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = student2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
  • In this linear regression model points are the target variable and attitude, strategy (stra) and surfface (surf) are explanatory variables.

  • Residuals of the model are between ~ -17.2 and ~10.9 when median is 0.52. I assume that errors are normally distributed but distribution needs confirmation.

  • Attitude is the only variable that has a very good significance in this model. p-value of stra and surf is too high to be even slightly significance.

  • Variables estimated coefficient is ~0.34 and it’s standard error is clearly smaller (~0.057). Other explanatory variables have not significance in this model.

  • Residual standard error is high in relation to a first and third quantiles of residuals.

  • This linear regression model of three explanatory variables explains ~19.3% of the points.

One variable

## 
## Call:
## lm(formula = points ~ attitude, data = student2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
  • Estimated coefficient of explanatory variable attitude is ~0.35. This means that when attitude rises one point target variable (points) grows 0.35 times.

  • Attitude p-value (0.00000000412) shows that variable is very significant in this linear regression model

  • This model explains 18.6% of the points (target variable)

Graphical model validation

Residuals vs Fitted This plot shows that there is no pattern between residuals. There is a constant variance among errors. One can confirm that the assumption of constant variance of errors is valid.

Normal Q_Q Normal QQ-plot shows that the errors of the model are normally distributed.

Residual vs Leverage Last plot of the graphical model validation shows that the impact of the singel observation is moderate. Model includes some outliers but the leverage of singel observation don’t compromise the validation of the model.


RStudio exercise 3: Logistic regression

Introduction to the data

Using Data Mining To Predict Secondary School Student Alcohol Consumption. Fabio Pagnotta, Hossain Mohammad Amran Department of Computer Science,University of Camerino

https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION

Data is rolled into one from Math course and Portuguese language course datasets. After the data wrangling exercise the new data set is found from the data folder.

alc <- read.table("data/student_alc.txt", header = TRUE)
dim(alc)
## [1] 382  35

The student data includes 35 variables and 382 rows.

  • The variables not used for joining the two data have been combined by averaging (including the grade variables)
  • ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’
  • ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

High and low alcohol consumption and other variables

In this analysis I’m going to study the relationship of high/low alcohol consumption between sex and the following variables:

Variable Type Description
age numeric student’s age
studytime numeric, scale 1-4 weekly study time
freentime numeric, scale 1-5 free time after school
absence numeric number of school absences

First these relationships are observed from tables and graphics. Hypothesis are as follows:

Age

The age of high consumption of alcohol may differ between sex. The development of charcter differs between young people and this may affect on habbits of alcohol consumption.

H0: Age don’t affect on alcohol consumption

H1: There is difference in between level of alcohol consumption and age

#a jitterplot of high_use, sex and age
g1 <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g1 + geom_jitter() + ggtitle("Age by alcohol consumption and sex")

It seems that there is randomnes of sex and age in both alcohol consumption groups.

#a boxplot of high_use, sex and age
g1 <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g1 + geom_boxplot() + ggtitle("Age by alcohol consumption and sex") + xlab("High consumption group") + ylab("Age of student")

Means however show that young male students have lower mean of age in low consumption group and females in high consumption group.

Study time

High alcohol consumption may be related to time spend in studies because one can’t do both at the same time at least not successfully.

H0: Alcohol consumption do not affect on weekly study time

H1: Alcohol consumption affects on weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

#barplot about study time
ggplot(alc, aes(studytime, fill = high_use)) + geom_bar(position = "fill") +
  ggtitle("Barplot about study time grouped by high_use")+ xlab("Study time") + ylab("Propotions of students") + scale_y_continuous(name = waiver(), breaks = waiver(), minor_breaks = waiver(), labels = waiver(), limits = NULL, expand = waiver(), na.value = NA_real_, trans = "identity")

It seems that there is less high users in those students groups who spend more time in studying (3-4) than in those who spend less time in studying (1-2).

Free time

High alcohol consumption may be related to free time so that studets who have more free time are consuming more alcohol that studets who haven’t as much free time.

H0: Amount of free time do not affect on alcohol consumption among students

H1: Amount of free time affects on alcohol consumption among students

#barplot about free time
ggplot(alc, aes(freetime, fill = high_use)) + geom_bar(position = "fill") +
  ggtitle("Barplot about free time grouped by high_use")+ xlab("Free time") + ylab("Propotions of students") + scale_y_continuous(name = waiver(), breaks = waiver(), minor_breaks = waiver(), labels = waiver(), limits = NULL, expand = waiver(), na.value = NA_real_, trans = "identity")

It seems that there are more students that are consuming alcohol high amounts in those students groups that have more free time than in those who have not as much free time.

Absences

High consumption of alcohol may cause absences.

H0: High consumption of alcohol do not affect on absences

H1: High consumption of alcohol does affect on absences

#a boxplot of high_use and absences
g1 <- ggplot(alc, aes(x = high_use, y = absences, fill = high_use))
g1 + geom_boxplot() + ggtitle("Absences by alcohol consumption") + xlab("High consumption group") + ylab("Absences")

It seems that the differences between alcohol consumption groups in absences are small. Mean of absences is higher in high consumption group but it may not be significant.

Logistic regression model

#the model with glm()
m <- glm(high_use ~ age + studytime + freetime + absences, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + studytime + freetime + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9649  -0.8267  -0.6238   1.0990   2.2861  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.39240    1.79306  -2.450 0.014299 *  
## age          0.18543    0.10313   1.798 0.072183 .  
## studytime   -0.51842    0.15867  -3.267 0.001086 ** 
## freetime     0.33015    0.12430   2.656 0.007907 ** 
## absences     0.07856    0.02269   3.463 0.000535 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 423.18  on 377  degrees of freedom
## AIC: 433.18
## 
## Number of Fisher Scoring iterations: 4

From the fitted model one can see that all explanatory variables except age are statistically significant with p-value < 0,01. Variable absence is also statistically sisgnificant with p-value < 0,001. It seems that age doesn’t explain whether or not a student is a high user of alcohol.

Odds ratio and confidence intervals

# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR        2.5 %    97.5 %
## (Intercept) 0.01237102 0.0003481843 0.3998333
## age         1.20373817 0.9852330457 1.4775972
## studytime   0.59545916 0.4321414645 0.8061535
## freetime    1.39118359 1.0938389338 1.7826010
## absences    1.08172440 1.0368164005 1.1336048

Odds ratio (OR) and the 95% confidence interval (CI) shows that those students who have a low study time are almost two times as likely to be a high user of alcohol than those studets who have higher study time. Students that have more freetime are also more likely to be a high user of alcohol. Also absences are positively correlated with high use of alcohol. Confidence interval shows that age is not statistically significant (because the interval contains 1) and other variables are.

Predictive power of the model

Predictive power of the final logistic regression model is calculated without the statistically insignificant variable age.

#the model with glm() and without the age variable  
m_final <- glm(high_use ~ studytime + freetime + absences, data = alc, family = "binomial")
summary(m_final)
## 
## Call:
## glm(formula = high_use ~ studytime + freetime + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9420  -0.8332  -0.6450   1.1266   2.1537  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.34105    0.56337  -2.380 0.017293 *  
## studytime   -0.50496    0.15691  -3.218 0.001290 ** 
## freetime     0.32626    0.12379   2.636 0.008401 ** 
## absences     0.08324    0.02262   3.680 0.000233 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 426.47  on 378  degrees of freedom
## AIC: 434.47
## 
## Number of Fisher Scoring iterations: 4
#predict and add the answer and the prediction to the data (alc)
probabilities <- predict(m_final, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities > 0.5)

#tabulate the target variable versus the prediction
table("High use" = alc$high_use, "Prediction" = alc$prediction)
##         Prediction
## High use FALSE TRUE
##    FALSE   254   14
##    TRUE     93   21

Table shows that the model predict 254 true negatives, 21 true positives, 14 false negatives and 93 false postives. This is sometimes called “confusion table”

table("High use" = alc$high_use, "Prediction" = alc$prediction) %>% prop.table() %>% addmargins()
##         Prediction
## High use      FALSE       TRUE        Sum
##    FALSE 0.66492147 0.03664921 0.70157068
##    TRUE  0.24345550 0.05497382 0.29842932
##    Sum   0.90837696 0.09162304 1.00000000

Propabilities of the same table shows that 90,8% is predicted to be false but only 66,5% of them is correct. 9,2% is predicted to be true but 5,5% of them realy are students with high use of alcohol.

#a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

Average number of wrong predictions

#defining a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

#calling loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2801047

The average number of wrong predictions in trainig data is 28%.

Cross validation

#computing the average number of wrong predictions in the (training) data
#loss_func(class = alc$high_use, prob = alc$probability)
#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m_final, K = 10)
#average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2774869

10-fold cross-validation gives good estimate of the actual predictive power of the model. Low value = good.


RStudio exercise 4: Clustering and classification

Introduction to the data

In this exercise we use Boston data from MASS-library. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. Data includes 14 variables and 506 rows.

## [1] 506  14
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
variable description
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitrogen oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted mean of distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio by town
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat lower status of the population (percent)
medv median value of owner-occupied homes in $1000

Graphical overview of the data

Plot matrix of the data

There are some very intresting distributions fo variables in the plot matrix. Variable rad has high and low values so the plot shows that the values are consenrated either side of the plot. VAriable *

Plotted correlation matrix

Plotted correlation matrix shows that there is some high correlation between variables:

  • Correlation is quite clear between industrial areas (indus) and nitrogen oxides (nox). Industry adds pollution in the area. Industry seems to correlate also with variablrs like age, dis, ras and tax.

  • Nitrogen oxides (nox) correlations are very similar with industry (indus)

  • Crime rate (crim) seems to correlate with good accessibilitty to radial highways (rad) and value property (tax).

  • Old houses (age) and employment centers have also something common

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Scaled data

All the variables are numerical so we can use scale()-function to scale whole data set.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
## [1] "matrix"

Scaling the data makes variables look as if they are in the same range. Variables like black and tax were before scaling hundred fold compared to some other variables.

Creating a new categorical variable crime

Variable crim is the base of the new categorical variable crime.

categories quantile points
low 0%-25%
med_low 25%-50%
med_high 50%-75%
high 75%-100%

Quantile points of the variable crim

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
## crime
##      low  med_low med_high     high 
##      127      126      126      127
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865

Train and test sets

Training set contains 80% of the data. 20% is in the test set.

##   [1] 186 322 180 417 485 280 379  81 168 103 247 499 391 116  20 502 320
##  [18] 427 355 178 245 413  47   6 197  67  97 396 269 130 403 104 136 315
##  [35] 284 412 241 430  72  62 100 123 258 167 375  52 327 240 101 364 236
##  [52] 331 195 234 279 432 409 411 345  45  43  21 368 131 382 233 108 125
##  [69]  99 446 264 281 303 344 129 333  10 348 297 451 156  51  83  25  22
##  [86] 468 469 250  13 455 183 203 490 217 349 360  26 481 302 329 124 133
## [103] 314 299 254  93 387 201  50 270  15 184 450 211 448 480 321 121 418
## [120] 335 259  82 188 248  63 138 429 489 137 500 179 208 230 443  87 504
## [137] 416 228 363 251 486 440 330 164 414 206 398  86 339 185 341 143 479
## [154] 224  95 290 135 161 255 300 456 126 505 239  96 421 488 393 193 397
## [171] 165 476 464 216 316  38 177 214 420 374  65 115 497 118 166 110 252
## [188] 160 503 328 210 438 337 353 453 493 127 260 286 401 380 447 466 332
## [205] 189 105  78 305 154 225 190 482 266 484 212  60 242 227 122 243 491
## [222] 147  73 483 113 383 437  76 276  46 237 268 288 257 318 467  49 338
## [239] 428  31  61 146 326 296 176 498  24 287 202 200 307 191  37  94 173
## [256] 253 370  42 340 235  75   1 192  68 159 155 462 207 244 439  34  16
## [273] 319  39  77  35 132 238 436 170 285 205 277 419 140 282 148 372 334
## [290] 354 218  71  92 367 442 325 487 199 474 312 219 204 182  53 465 475
## [307]  28  57   9  54   7 232  66 220 323 142   8 386 405 151 471 388  80
## [324] 392 272 152 400 317 229 162 308 452 385 457  44  18 292 256 149 112
## [341] 313 449 271 306 358  32 362 107 463 106 458  14   2 441 114 431 501
## [358] 359 425 445 310 389  29 477 150 373 134 301 181 394  90 492 460 336
## [375] 495  85 246  58 311 384 406 249 294 119  23 304 175 293 215 295 261
## [392]  12 342  36 369 102 172 291 444 346 283 265 198 365

Fitting the Linear Discriminant Analysis

First the linear discriminant analysis (LDA) is fitted to the train set. The new categorical variable crime is the target variable and all the other variables of the dataset are predictor variables.

After fitting we draw the LDA biplot with arrows.

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2574257 0.2500000 0.2376238 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       0.98889890 -0.8846501 -0.15765625 -0.8687426  0.40827464
## med_low  -0.09969684 -0.2674719 -0.08304540 -0.5643705 -0.14716922
## med_high -0.38365582  0.2003265  0.11748284  0.3809752  0.08160335
## high     -0.48724019  1.0149946 -0.06727176  1.0521399 -0.41711463
##                 age        dis        rad        tax    ptratio      black
## low      -0.8989016  0.8521780 -0.6741266 -0.7486137 -0.5135669  0.3770142
## med_low  -0.3258397  0.3374276 -0.5456746 -0.4736838 -0.1082613  0.3094517
## med_high  0.3834406 -0.3572742 -0.4144603 -0.3025424 -0.2305360  0.1111525
## high      0.7955508 -0.8361604  1.6596029  1.5294129  0.8057784 -0.8466439
##                lstat         medv
## low      -0.74806739  0.506935581
## med_low  -0.14471908  0.001451274
## med_high  0.01121753  0.136705280
## high      0.93200127 -0.699891284
## 
## Coefficients of linear discriminants:
##                 LD1           LD2         LD3
## zn       0.13853250  0.8046371288 -0.96451679
## indus    0.04391570 -0.1680047542  0.26465188
## chas    -0.08371156 -0.0372860741  0.08552400
## nox      0.22055079 -0.7841836940 -1.54242664
## rm      -0.10710158 -0.1299699406 -0.17553623
## age      0.25615162 -0.2961059881  0.02677354
## dis     -0.12284019 -0.3566217762  0.20012963
## rad      3.44990822  1.0867302850 -0.05182276
## tax     -0.01093013 -0.2499253231  0.66756378
## ptratio  0.12045867  0.0153931459 -0.38730853
## black   -0.14829277 -0.0007841421  0.07953713
## lstat    0.22648702 -0.2292112397  0.27745623
## medv     0.18076295 -0.3947858497 -0.23971095
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9513 0.0359 0.0128
##   [1] 1 2 1 4 3 2 4 1 3 2 3 2 4 2 3 1 3 4 1 1 2 4 2 1 1 1 2 4 3 3 4 2 3 3 1
##  [36] 4 2 4 2 2 1 2 3 3 4 1 3 2 2 4 3 1 1 3 1 4 4 4 1 2 2 3 4 3 4 3 2 2 1 4
##  [71] 3 1 2 1 3 1 2 1 1 4 3 2 1 3 3 4 4 2 2 4 2 1 2 1 1 4 3 4 1 1 2 3 3 1 3
## [106] 1 4 1 2 2 3 2 4 2 4 4 2 1 4 1 3 1 1 2 2 3 4 2 3 2 1 2 3 4 1 1 4 3 4 2
## [141] 3 4 1 3 4 2 4 1 1 2 1 3 4 3 1 1 3 3 1 1 4 2 2 2 2 4 4 4 2 4 3 4 4 2 2
## [176] 1 1 2 4 4 1 2 3 2 3 3 2 3 1 2 3 4 1 1 4 2 3 3 1 4 4 4 3 1 2 2 2 1 3 3
## [211] 2 4 3 3 3 2 2 3 1 2 2 3 2 4 2 4 4 2 2 2 3 3 1 1 2 4 2 1 4 3 2 3 2 2 1
## [246] 3 3 1 1 1 1 2 2 1 2 2 4 2 1 3 1 1 1 1 3 3 4 2 2 4 3 3 3 2 2 3 3 3 4 3
## [281] 1 1 2 4 3 1 3 4 1 1 1 2 1 4 4 3 4 1 4 3 2 1 1 1 4 4 3 1 2 1 2 3 1 2 3
## [316] 3 2 4 4 3 4 4 2 4 2 3 4 3 3 3 1 4 4 4 2 3 1 1 3 2 3 4 3 1 4 3 4 2 4 2
## [351] 4 3 1 4 2 4 2 4 4 4 3 4 3 4 3 4 3 1 1 4 1 2 4 1 3 1 2 1 3 4 4 2 2 2 3
## [386] 2 2 1 3 1 3 2 1 1 4 2 3 1 4 1 1 3 1 3

Predicting the classes

##           predicted
## correct    low med_low med_high high
##   low       14       9        1    0
##   med_low    4      14        4    0
##   med_high   1       8       15    1
##   high       0       0        1   30

Prediction were quite good. There was some errors in the middle of the range but classes low and especially high were good. Only one correct representative of high class was predicted to med_low class.

K-means algorithm

I’m going to calculate what is the optimal number of clusters for Boston data. First I reload and scale the data. Variables need to be scaled to get comparable distances between observation.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Next I calculate the distances between observations and determinen the number of clusters.

One way to determine the number of clusters is to look how the total of within cluster sum of squares (WCSS) behaves when the number of clusters changes. WCSS was calculated from 1 to 15 clusters. The optimal number of clusters is when the total WCSS drops radically. It seems that in this case optimal number of clusters is two. However we are here to learn so I decided to analyse model with four clusters.

After determining the number of clusters I run the K-means alcorithm again.

It seems that when the data is divided to four clusters there is some clear differences in distriputions of several variables. Crim, zn, indus and blacks are variables were one can distinguish clear patterns between clusters. Some variables (rad & tax) show that sometimes 1 or 2 clusters make a clear distripution but observation of other two clusters are ambigious and there is no clear pattern to be regognised.

BONUS: LDA using clusters as target classes

After loading the Boston dataset I scale it to get comparable distances.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv             clust      
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063   Min.   :1.000  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989   1st Qu.:2.000  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449   Median :3.000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   :2.674  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683   3rd Qu.:3.000  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865   Max.   :4.000

Original Boston dataset is now scaled and the result of K-means clustering is saved to the variable clust

LDA with the clusters

Next the LDA is performed and the biplot with arrows is created

## Call:
## lda(clust ~ ., data = scaled_Boston)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2114625 0.1304348 0.4308300 0.2272727 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm
## 1 -0.3912182  1.2671159 -0.8754697  0.5739635 -0.7359091  0.9938426
## 2  1.4330759 -0.4872402  1.0689719  0.4435073  1.3439101 -0.7461469
## 3 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## 4  0.2797949 -0.4872402  1.1892663 -0.2723291  0.8998296 -0.2770011
##          age        dis        rad        tax     ptratio       black
## 1 -0.6949417  0.7751031 -0.5965444 -0.6369476 -0.96586616  0.34190729
## 2  0.8575386 -0.9620552  1.2941816  1.2970210  0.42015742 -1.65562038
## 3 -0.3256000  0.3182404 -0.5741127 -0.6240070  0.02986213  0.34248644
## 4  0.7716696 -0.7723199  0.9006160  1.0311612  0.60093343 -0.01717546
##        lstat        medv
## 1 -0.8200275  1.11919598
## 2  1.1930953 -0.81904111
## 3 -0.2813666 -0.01314324
## 4  0.6116223 -0.54636549
## 
## Coefficients of linear discriminants:
##                 LD1        LD2         LD3
## crim     0.18113078 -0.5012256 -0.60535205
## zn       0.43297497 -1.0486194  0.67406151
## indus    1.37753200  0.3016928  1.07034034
## chas    -0.04307937 -0.7598229 -0.22448239
## nox      1.04674638 -0.3861005 -0.33268952
## rm      -0.14912869 -0.1510367  0.67942589
## age     -0.09897424  0.0523110  0.26285587
## dis      0.13139210 -0.1593367 -0.03487882
## rad      0.65824136  0.5189795  0.48145070
## tax      0.28903561 -0.5773959  0.10350513
## ptratio  0.22236843  0.1668597 -0.09181715
## black   -0.42730704  0.5843973  0.89869354
## lstat    0.24320629 -0.6197780 -0.01119242
## medv     0.21961575 -0.9485829 -0.17065360
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7596 0.1768 0.0636

Biplot shows that variables indus, zn and medv are the most influencial separators for the clusters.

Super-bonus

3D plot where observations color is the crime classes of the train set

3D plot where observations color is based on the K-means clusters.

Colors of the both plots is based to four classes. It seems that K-means plot shows the different clusters more clearly than the plot that is based on the crime classification.


RStudio exercise 5: Dimensionality reduction techniques

library(MASS)
library(ggplot2)
library(GGally)
library(corrplot)
library(dplyr)
library(plotly)

Introduction to the data

The data of this exercise originates from the United Nations Development Programme. Human development index (HDI) was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone.

Original data can be found from: http://hdr.undp.org/en/content/human-development-index-hdi

Modified dataset includes 9 variables and 155 observations as follows:

variable description
country Country name (name of the row)
gni_capita Gross national income per capita
life_exp Life expectancy at birth
edu_years Expected years of schooling
mortality Maternal mortality ratio
young_mom Adolescent birth rate
women_parlament Percentange of female representatives in parliament
edu_ratio Ratio between females and males at least secondary education
labour_ratio Ratio between females and males in the labour force

Summary of the data

## 'data.frame':    155 obs. of  8 variables:
##  $ gni_capita     : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ life_exp       : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ edu_years      : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ mortality      : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ young_mom      : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ women_parlament: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  $ edu_ratio      : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labour_ratio   : num  0.891 0.819 0.825 0.884 0.829 ...

All the variables are either numerical or integers. Two of the variables are ratios (edu_ratio, labour_ratio).

Summary, distributions of the variables and the relationships between them

##    gni_capita        life_exp       edu_years       mortality     
##  Min.   :   581   Min.   :49.00   Min.   : 5.40   Min.   :   1.0  
##  1st Qu.:  4198   1st Qu.:66.30   1st Qu.:11.25   1st Qu.:  11.5  
##  Median : 12040   Median :74.20   Median :13.50   Median :  49.0  
##  Mean   : 17628   Mean   :71.65   Mean   :13.18   Mean   : 149.1  
##  3rd Qu.: 24512   3rd Qu.:77.25   3rd Qu.:15.20   3rd Qu.: 190.0  
##  Max.   :123124   Max.   :83.50   Max.   :20.20   Max.   :1100.0  
##    young_mom      women_parlament   edu_ratio       labour_ratio   
##  Min.   :  0.60   Min.   : 0.00   Min.   :0.1717   Min.   :0.1857  
##  1st Qu.: 12.65   1st Qu.:12.40   1st Qu.:0.7264   1st Qu.:0.5984  
##  Median : 33.60   Median :19.30   Median :0.9375   Median :0.7535  
##  Mean   : 47.16   Mean   :20.91   Mean   :0.8529   Mean   :0.7074  
##  3rd Qu.: 71.95   3rd Qu.:27.95   3rd Qu.:0.9968   3rd Qu.:0.8535  
##  Max.   :204.80   Max.   :57.50   Max.   :1.4967   Max.   :1.0380

There is greate variation between numeric variables. Variable like Gross national income per capita (gni_capita) differs between 581$ (min) -123,124$ (max) and at the same time there is two variables (edu_ratio, labour_ratio) that are ratios and differs both sides of number 1. Life expectancy (life_exp) and expected years of schooling (edu_years) are in years and the max observation is under 100 years. The percentange of female representatives in parliament (women_parlament) is the only variable that is represented in percentange.

Maternal mortality (mortality) ratio correlates strongly (negatively) with life expectancy, excpected years of schooling and education ratio between females and males. Maternal mortality and adolescent birth rates appears to be connected too. One could make conclusion that in some countries uneducated females give birth at very young age and that leeds to high maternal mortality rate.

Gross income per capita appears to be connected to life expectancy and expected years in schooling. Is it so that educated labour force helps society to prosper in economic sense.

Women in parliament seems to correlate with education rate between females and males. It seems that the gender equality starts from education and enables both genders to participate in social desicion-making.

Principal component analysis (PCA)

First we conduct PCA with unstandardized data

## Standard deviations:
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation:
##                           PC1           PC2           PC3           PC4
## gni_capita      -9.999832e-01 -0.0057723054  5.156742e-04 -4.932889e-05
## life_exp        -2.815823e-04  0.0283150248 -1.294971e-02  6.752684e-02
## edu_years       -9.562910e-05  0.0075529759 -1.427664e-02  3.313505e-02
## mortality        5.655734e-03 -0.9916320120 -1.260302e-01  6.100534e-03
## young_mom        1.233961e-03 -0.1255502723  9.918113e-01 -5.301595e-03
## women_parlament -5.526460e-05  0.0032317269  7.398331e-03  9.971232e-01
## edu_ratio       -5.607472e-06  0.0006713951  3.412027e-05  2.736326e-04
## labour_ratio     2.331945e-07 -0.0002819357 -5.302884e-04  4.692578e-03
##                           PC5           PC6           PC7           PC8
## gni_capita      -0.0001135863  2.711698e-05  8.075191e-07 -1.176762e-06
## life_exp         0.9865644425  1.453515e-01 -5.380452e-03  2.281723e-03
## edu_years        0.1431180282 -9.882477e-01  3.826887e-02  7.776451e-03
## mortality        0.0266373214 -1.695203e-03 -1.355518e-04  8.371934e-04
## young_mom        0.0188618600 -1.273198e-02  8.641234e-05 -1.707885e-04
## women_parlament -0.0716401914  2.309896e-02  2.642548e-03  2.680113e-03
## edu_ratio       -0.0022935252 -2.180183e-02 -6.998623e-01  7.139410e-01
## labour_ratio     0.0022190154 -3.264423e-02 -7.132267e-01 -7.001533e-01
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

The plot don’t look good. A large variance of the gross income per capita variable makes it too important for the principal component analysis. The data need to be standardized.

Summary of the standardized variables

##    gni_capita         life_exp         edu_years         mortality      
##  Min.   :-0.9193   Min.   :-2.7188   Min.   :-2.7378   Min.   :-0.6992  
##  1st Qu.:-0.7243   1st Qu.:-0.6425   1st Qu.:-0.6782   1st Qu.:-0.6496  
##  Median :-0.3013   Median : 0.3056   Median : 0.1140   Median :-0.4726  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.6717   3rd Qu.: 0.7126   3rd Qu.: 0.1932  
##  Max.   : 5.6890   Max.   : 1.4218   Max.   : 2.4730   Max.   : 4.4899  
##    young_mom       women_parlament     edu_ratio        labour_ratio    
##  Min.   :-1.1325   Min.   :-1.8203   Min.   :-2.8189   Min.   :-2.6247  
##  1st Qu.:-0.8394   1st Qu.:-0.7409   1st Qu.:-0.5233   1st Qu.:-0.5484  
##  Median :-0.3298   Median :-0.1403   Median : 0.3503   Median : 0.2316  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6030   3rd Qu.: 0.6127   3rd Qu.: 0.5958   3rd Qu.: 0.7350  
##  Max.   : 3.8344   Max.   : 3.1850   Max.   : 2.6646   Max.   : 1.6632

PCA with standardized data

## Standard deviations:
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation:
##                         PC1         PC2         PC3         PC4        PC5
## gni_capita      -0.35048295 -0.05060876  0.20168779 -0.72727675  0.4950306
## life_exp        -0.44372240  0.02530473 -0.10991305 -0.05834819 -0.1628935
## edu_years       -0.42766720 -0.13940571  0.07340270 -0.07020294 -0.1659678
## mortality        0.43697098 -0.14508727  0.12522539 -0.25170614  0.1800657
## young_mom        0.41126010 -0.07708468 -0.01968243  0.04986763  0.4672068
## women_parlament -0.08438558 -0.65136866 -0.72506309  0.01396293  0.1523699
## edu_ratio       -0.35664370 -0.03796058  0.24223089  0.62678110  0.5983585
## labour_ratio     0.05457785 -0.72432726  0.58428770  0.06199424 -0.2625067
##                         PC6         PC7         PC8
## gni_capita      -0.11120305 -0.13711838 -0.16961173
## life_exp         0.42242796 -0.43406432  0.62737008
## edu_years        0.38606919  0.77962966 -0.05415984
## mortality       -0.17370039  0.35380306  0.72193946
## young_mom        0.76056557 -0.06897064 -0.14335186
## women_parlament -0.13749772  0.00568387 -0.02306476
## edu_ratio       -0.17713316  0.05773644  0.16459453
## labour_ratio     0.03500707 -0.22729927 -0.07304568
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000